Skip to content

Change from TextFileReader to ParquetStreamReader#348

Open
JanWillruth wants to merge 49 commits intoglamod:mainfrom
JanWillruth:reader_io
Open

Change from TextFileReader to ParquetStreamReader#348
JanWillruth wants to merge 49 commits intoglamod:mainfrom
JanWillruth:reader_io

Conversation

@JanWillruth
Copy link
Collaborator

@JanWillruth JanWillruth commented Jan 15, 2026

To do

  • Change from TextFileReader to ParquetStreamReader for (better) handling of files larger than RAM when chunk_size is specified
  • Rework affected Databundle code
  • Move ParquetStreamReader from mdf_reader.utilis.utilities to common.iterators
  • Rework affected cdm_mapper.mapper code
  • Rework affected common.select code
  • Rework affected common.inspect code
  • Rework affected metmetpy.validate code
  • Rework affected metmetpy.correct code
  • Rework affected mdf_reader.utilities.utils code
  • Rework affected mdf_reader.writer code
  • Rework affected core._utilities._copy code
  • Add ParquetStreamReader option to common.replace code
  • Remove common.pandas_TextParser_hdlr
  • clean up / update testing suite -> see re-work testing suite #365
  • do not use 'make_parser' in testing suite -> see re-work testing suite #365
  • Rework docstrings -> see enforce mypy hook #368
  • Rework type hints -> see enforce mypy hook #368

Ideas

  • write a decorator so that we can call a function in the "normal" way, but the decorator decides whether to execute the function or pass it to common.iterators.process_disk_backed

    • see _apply_or_chunk but as a decorator
    • e.g. both is valid my_func(DataFrame, *args, **kwargs) and my_func(ParquetStreamReader, *args, **kwargs)

Issues

This PR addresses opened issues:

…ng of files larger than RAM when chunk_size is specified; Rework affected Databundle code
@github-actions
Copy link

Warning
This Pull Request is coming from a fork and must be manually tagged approved
in order to perform additional testing.

…ames; Remove unneeded TextFileReader tests form test_pandas.py
@ludwiglierhammer
Copy link
Collaborator

@JanWillruth: I made some high performance tests. This PR does not really affect the maximum memory usage, but speeds up the code a liitle bit. Nevertheless, we should use this PR your readability reasons.

Do you want to move the ParquetStreamReader to common. Then, we could have the ParquetStreamReader at one general place and could make use of it in common.select, mdf_reader and cdm_mapper.

@ludwiglierhammer ludwiglierhammer mentioned this pull request Jan 28, 2026
2 tasks
@ludwiglierhammer
Copy link
Collaborator

I merged #360 into the main branch. Please resolve the merge conflicts and we can focus on this PR again.

JanWillruth and others added 3 commits January 29, 2026 11:57
# Conflicts:
#	cdm_reader_mapper/mdf_reader/utils/utilities.py
#	tests/test_reader_utilities.py
@ludwiglierhammer
Copy link
Collaborator

Hi @jtsiddons, what do you think about this PR? Some ideas for improvements or generel comments?

@jtsiddons
Copy link
Collaborator

Hi @jtsiddons, what do you think about this PR? Some ideas for improvements or generel comments?

Thanks @ludwiglierhammer - I've scheduled some time to have a look this afternoon

@ludwiglierhammer
Copy link
Collaborator

ludwiglierhammer commented Feb 13, 2026

Hi @jtsiddons, we could replace all TextFileReader elements with the new ParquetStreamReader. Do you have any further suggestions for this PR. We would appreciate your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor logic to handle chunking outside of reader(/mapper) for readability/maintenance TextFileReader needed?

3 participants

Comments